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Abstract:  Low-cost  fault  tolerance  requires  careful 
combination  of  fault  tolerance  techniques  across  all  levels 
of  the  system  stack.  We  describe  a  systematic  framework, 
applicable  to  simple  and  complex  cores  as  well  as 
specialized  accelerators,  to  explore  this  cross-layer  design 
space  in  terms  of  area,  power,  and  performance  costs,  and 
present  illustrative  results. 
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Introduction 

It  is  clear  that  at  recent  technology  nodes,  and  moving 
forward,  even  commodity  systems  are  increasingly 
susceptible  to  hardware  failures.  We  show  the  benefits  of 
combining  techniques  across  layers,  from  circuit-level  to 
application-level,  which  allows  the  designer  to  select  a 
desired  fault  tolerance  with  much  lower  overhead.  We 
present  a  systematic  methodology  and  a  new  framework 
that  allows  for  quick  exploration  and  analysis  of  cross-layer 
resilience.  We  demonstrate  the  effectiveness  of  our 
framework  using  a  wide  range  of  resilience  techniques: 
from  circuit-,  logic-,  and  architecture-level  techniques  to 
software-implemented  and  application-specific  resilience. 

Related  Work 

There  has  been  significant  research  into  various  protection 
schemes  across  all  system  layers  that  help  improve  overall 
system  reliability.  The  challenge  is  determining  how  to 
combine  and  integrate  these  techniques.  Prior  work  on  fault 
tolerance  has  typically  proposed  techniques  that  target  a 
single  layer  of  the  system  stack.  Even  prior  work  showing 
how  to  combine  prior  techniques  to  reduce  overhead  (e.g., 
[14])  still  operates  within  a  single  layer  of  the  system  stack. 

Unfortunately,  very  little  literature  exists  on  such  cross¬ 
layer  tradeoffs.  Some  prior  work  (e.g.,  [2,  4])  has  suggested 
the  need  for  cross-layer  integration  of  resilience  techniques, 
but  ours  is  the  first  truly  cross-layer  framework  to 
systematically  evaluate  area,  power,  and  performance 
overheads  under  various  reliability  constraints. 

Methodology  and  Framework 

A  fundamental  difference  of  our  methodology  to  past 
practices  is  that  we  apply  a  systematic  approach  to  cross¬ 
layer  resilience  based  on  in-depth  analysis  rather  than 
designer  intuition  to  guide  cross-layer  combinations. 
Furthermore,  our  new  framework  (Fig.  1)  allows  us  to 


quickly  explore,  evaluate,  and  optimize  cross-layer 
resilience.  The  main  components  of  the  framework  are  a 
very  rich  resilience  library  that  incorporates  resilience 
techniques  from  across  the  system  stack,  a  carefully  crafted 
cross-layer  fault  injector  that  is  highly  accurate  and  fast,  a 
system-aware  design-space  exploration  that  allows  for 
quick  exploration  of  the  hundreds  of  possible  design  points, 
and  a  layout-aware  exploration  that  accounts  for  all  wire 
and  routing  considerations,  which  impact  overheads 
significantly,  for  key  designs.  Physical  design  overheads 
are  evaluated  using  a  28nm  TSMC  library  and  Synopsys 
design  tools  (Design  Compiler,  IC  compiler,  and 
PrimeTime)  to  perform  synthesis,  place-and-route,  and 
power  and  timing  analysis.  The  framework  automatically 
and  intelligently  inserts  resilience  into  baseline  designs  to 
provide  optimized  cross-layer  solutions  and  generates 
reports  detailing  the  achieved  reliability  as  well  as  the  area, 
power,  and  performance  overheads  of  the  resilient  design. 


Figure  1.  Our  cross-layer  framework 


Case  Study 

We  present,  as  a  case  study,  the  application  of  our  cross¬ 
layer  methodology  and  framework  on  the  Leon3.  For  the 
purposes  of  this  study,  we  focus  on  the  particular  case  of 
single  event  upsets  (SEU)  caused  by  soft  errors  (radiation- 
induced  transient  errors)  since  the  transient  nature  of  these 
types  of  errors  lead  to  interesting  challenges  such  as 
architectural  masking  factors  in  the  design. 

Although  we  only  present  results  for  Leon3  (a  simple,  in- 
order  core),  we  have  also  applied  our  framework  to  IVM  (a 
complex,  out-of-order  core)  to  reinforce  the  fact  that  our 
methodology  and  framework  is  applicable  to  any  system 
(ranging  from  simple  embedded  systems  to  complex 
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server-class  systems)  and  any  arbitrary  fault  model 
(permanent,  transient,  multi -bit,  etc.) 

System  Design  Analysis 

Using  flip-flop-level  fault  injection,  we  rank  the 
vulnerability  of  each  flip-flop  in  the  processor  in  terms  of 
its  likelihood  to  propagate  faults  [3].  This  allows  the 
designer  to  target  a  desired  level  of  fault  tolerance,  which 
guides  cross-layer  exploration.  Analysis  is  conducted  on  a 
diverse  set  of  benchmarks  including  the  SPECINT  2000 
and  DARPA  PERFECT  benchmarks.  We  evaluate  fully 
routed  designs  in  order  to  accurately  account  for  crucial 
physical  design  aspects  like  wire  routing. 

Resilience  Technique  Characterization 

Here,  we  briefly  summarize  some  of  the  resilience 
techniques  we  have  explored  and  illustrate  how  our 
characterization  affects  their  application  in  our  study. 

Circuit-level  Techniques:  At  the  circuit-level,  we  look  at 
radiation  hardened  flip-flops,  which  are  flip-flops  designed 
to  uphold  the  bit  representation  of  their  output  circuit  even 
under  particle  strikes  [1,  6,  10,  13].  We  look  at  a  range  of 
flip-flops  that  offer  tradeoffs  in  area  and  power  versus  the 
amount  of  resilience  provided  (Table  1). 


Table  1.  Hardened  flip-flop  comparison 


Type 

Soft  Error  Rate 

Area 

Power 

Delay 

Baseline 

1.0 

1.0 

1.0 

1.0 

Light 

2.5  x  10'1 

1.2 

1.1 

1.2 

Moderate 

5.0  x  10'2 

1.3 

1.5 

1.7 

Heavy 

2.0  x  10"4 

2.0 

1.8 

1.0 

Logic-level  Techniques:  At  the  logic-level,  we  explore 
logic  parity,  a  technique  which  detects  soft-errors  affecting 
groups  of  sequential  cell  elements  [9].  Naive 
implementations  will  result  in  significant  impact  on  clock 
frequency.  In  contrast,  by  implementing  pipelined  parity 
(accomplished  by  adding  extra  pipeline  flip-flops),  we 
maintain  the  original  design  frequency  (Fig.  2). 

However,  one  must  still  exercise  extreme  care  since  there 
are  many  different  parameters  to  use  for  optimization,  such 
as  parity  group  size,  flip-flop  vulnerability,  cell  locality, 
and  timing  slack.  Choosing  the  wrong  optimization 
parameter  can  lead  to  40%  area  and  80%  power  overhead 
as  compared  to  the  best  heuristic.  Our  framework  searches 
the  design  space  to  find  the  best  overall  solution. 


maintain  clock  period 

I  parity  group  (4-32  FF  size) 

'  ✓ 


Architecture-level  Techniques:  Most  micro-architectural 
techniques  have  high  cost  in  area  (e.g.,  redundancy)  or 
performance  (e.g.,  redundant  multi-threading).  One 
promising  technique  that  we  evaluate  is  Dataflow  Checking 
(DFC)  [8],  which  detects  errors  in  control  flow  as  well  as 
changes  in  operations  and  operands.  Hardware  overhead 
for  detection  is  negligible,  but  recovery  requires  buffering 
results  from  an  entire  basic  block,  making  the  overhead 
prohibitive  in  a  simple  processor  such  as  Leon3.  In  IVM, 
however,  the  reorder  buffer  allows  for  low-cost  recovery; 
but  DFC  covers  a  smaller  fraction  of  the  design. 
Quantification  of  these  tradeoffs  is  work  in  progress. 


Figure  3.  Transitions  between  basic  blocks  in  DFC 

Softw’are-level  Techniques:  We  analyze  two  Software- 
Implemented  Hardware  Fault  Tolerance  (SIHFT)  resilience 
techniques:  Control-Flow  Checking  by  SofhA’are  Signatures 
( CFCSS)  [11]  and  Error  Detection  by  Duplicated 
Instructions  (EDDT)  [12].  Although  these  techniques  do  not 
incur  area  and  power  overhead  from  additional  hardware, 
we  see  an  application  slowdown  of  on  average  1.5X  for 
CFCSS  and  2.2x  for  EDDI.  Furthermore,  these  software 
techniques  tend  to  cover  only  a  limited  range  of  flip-flops 
(around  20%). 

Application-specific  Techniques:  We  look  at  Algorithm 
Based  Fault  Tolerance  ( ABFT)  applied  to  select  PERFECT 
benchmarks  [5].  We  have  found  that  these  techniques 
generally  lie  in  the  ideal  region  of  high  error-detection  rate 
at  low  runtime  overhead.  They  only  provide  protection 
when  running  the  corresponding  algorithm,  so  in  general- 
purpose  processors,  hardware  protection  is  still  needed. 
However,  in  accelerators,  ABFT  can  eliminate  the  need  for 
other  fault-tolerance  mechanisms  in  dedicated  hardware,  or 
in  more  general-purpose  processors,  allow  hardware 
protection  to  be  turned  off  for  potential  power  savings 
when  the  protected  algorithms  are  running.  We  note  that, 
some  ABFT  techniques,  such  as  for  FFT,  only  detect  but 
cannot  correct  errors. 

Recovery  Considerations:  We  round  out  our  analysis  by 
examining  recovery  options  for  Leon3.  The  simplest  cross¬ 
layer  recovery  method  takes  advantage  of  the  built-in 
pipeline-flush  mechanism  (Fig.  4).  Errors  that  occur  and  are 
detected  before  reaching  the  memory  write  stage  of  the 
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pipeline  can  be  flushed  and  re-executed.  Implementing  this 
recovery  mechanism  costs  0.6%  area  and  0.9%  power. 
However,  it  does  require  that  the  last  stages  in  the  pipeline 
be  protected  using  techniques  that  both  detect  and  correct 
(e.g.,  hardened  flip-flops).  The  second  method  of  recovery 
we  evaluate  is  a  hardware  recovery  unit  (R-Unit),  which 
utilizes  instruction  checkpointing  [7]  (Fig.  5).  Although 
more  general,  this  unit  is  relatively  expensive  on  the  Leon3, 
requiring  16%  area  and  21%  power  overhead. 


M  Cross-layer  protected 
MU  Radiation-hardening  protected 
□  Recovery  hardware 


Figure  4.  Cross-layer  (flush)  recovery 
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m  Cross-layer  protected 
w*  Radiation-hardening  protected 
□  Recovery  hardware 


Figure  5.  R-unit  recovery 


Cross-Layer  Combinations 

We  examine  case  studies  on  the  Leon3  core  to  demonstrate 
systematic  cross-layer  exploration  using  our  framework. 


Circuit-  and  Logic-level  Combinations:  We  show  that 
careful  combination  of  hardening  and  parity  protection  for 
flip-flops,  allows  designs  that  achieve  significant  savings  as 
compared  to  a  single-layer  (hardening-only  or  parity-only) 
approach.  For  instance,  in  Leon3,  even  after  considering 
recovery,  our  cross-layer  solution  yields  a  savings  of 
anywhere  from  1.1-1.9X  in  area  and  l.l-1.5x  in  power 
(after  place  and  route)  as  compared  to  the  best  single-layer 
resilience  approach.  In  fact,  for  all  soft-error  rate  (SER) 
improvement  targets  (from  5-5,000x  improvement),  our 
framework  automatically  generates  a  lower  cost  solution 
than  the  best  single-layer  solution  available  (Fig.  6). 


Circuit-,  Logic-,  and  Architecture-level  Combinations: 
Using  our  two-layer  solution  as  a  baseline,  we  investigate 
whether  incorporating  additional  layers  into  our  cross-layer 
exploration  yields  further  savings.  We  look  at  Data  Flow 
Checking  (DFC),  an  architecture-level  technique.  For 
Leon3,  DFC  detects  errors  in  flip-flops  carrying  program 
counter  and  instruction  content  throughout  the  pipeline 
until  the  last  stage  (write -back)  where  the  check  is 
performed.  However,  these  flip-flops  lack  protection  during 
basic  blocks  following  branches  that  are  indirect  or  outside 
of  the  binaiy  address  range  (~20%  of  duration).  Moreover, 
some  flip-flops  only  have  errors  that  are  occasionally 
detected  by  DFC,  resulting  in  partial  detection.  Overall, 


35%  of  flip-flops  are  fully  or  partially  protected  with  a  50% 
average  detection  rate. 

As  a  result,  the  low  per-flip-flop  detection  requires 
significant  low-level  (hardening  and  parity)  protection, 
which  negates  the  benefits  of  using  DFC.  Moreover, 
coupled  with  the  high  design  cost  due  to  compiler 
modifications  and  the  corresponding  hardware  checker, 
high  area  cost  of  the  checker  hardware,  and  the  high 
(~20%)  performance  overheads,  we  find  that  cross-layer 
solutions  incorporating  this  specific  architecture-level 
technique  are  not  viable  solutions. 
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a)  Post-layout  area  overhead 
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b)  Post-layout  power  overhead 

Figure  6.  Post-layout  overhead  comparison  (circuit-  and 
logic-level  resilience) 

Circuit-,  Logic-,  and  Application-specific  Combinations: 
Let  us  consider  the  classical  case  of  adding  Algorithm 
Based  Fault  Tolerance  (ABFT)  correction  for  a  processor 
dedicated  to  inner  product.  We  find  that  a  cross-layer 
combination  actually  yields  anywhere  from  a  1 .8-3 .5  x  area 
improvement  (for  a  dedicated  processor)  and  1.4-3.7x 
power  improvement  (post-layout)  for  resilience 
improvements  in  the  range  of  5-100x  as  compared  to  our 
previous  best  two-layer  approach  (Fig.  7).  Keep  in  mind 
that  these  improvements  are  even  higher  when  we  compare 
to  the  best  single-layer  approach  instead. 

These  significant  benefits  are  achievable  because  ABFT  is 
actually  allowing  us  to  reduce  the  amount  of  low-level 
hardware  protection  required  (by  20-50%)  to  meet  our 
resilience  targets  However,  for  highly  resilient  design 
targets  (e.g.,  500-5,000x),  the  addition  of  ABFT  into  our 
cross-layer  analysis  yields  negligible  benefits.  Although 
ABFT  has  high  coverage  for  a  small  number  of  highly 
vulnerable  flip-flops,  residual  errors  of  non-fully  protected 
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flip-flops  will  still  require  low-level  hardware 
augmentation  in  order  to  achieve  a  highly-resilient  design. 
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Figure  7.  Post-layout  overhead  comparison  (circuit-,  logic-, 
and  application-specific  resilience) 

Circuit-,  Logic-,  Architecture-,  and  Application-specific 
Combinations:  Although  we  know  that  cross-layer 
combinations  of  DFC  (an  architecture-level  technique)  and 
low-level  hardware  resilience  (hardening  and  parity)  do  not 
yield  any  meaningful  benefits,  we  want  to  discover  if  this 
still  holds  when  we  incorporate  application-specific 
techniques  due  to  the  fact  that  these  two  higher-level 
techniques  could  potentially  complement  one  another. 

Unfortunately,  although  DFC  provides  some  minimal 
benefits  (i.e.,  the  number  of  flip-flops  requiring  low-level 
hardware  augmentation  drops  by  an  additional  10-20%), 
the  area  and  power  costs  of  the  DFC  checker  itself  negate 
any  benefits  afforded.  In  actuality,  this  cross-layer 
combination  could  be  significantly  worse,  because  DFC  (in 
the  case  of  an  embedded,  real-time  system)  will  incur 
tremendous  recovery  costs. 

Conclusions 

We  present  a  new  methodology  and  framework  for 
exploring,  evaluating,  and  optimizing  cross-layer  resilience. 
Our  methodology  and  framework  is  applicable  to  any 
arbitrary  design  and  for  any  arbitrary  use  case.  By  taking  a 
system-level  view  of  resilience,  we  can  systematically  find 
effective  cross-layer  combinations  to  achieve  desired 
resilience  at  much  lower  cost  than  the  best  possible  single¬ 
layer  solutions. 
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